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Abstract 

In the context of language recognition, we demonstrate the superiority of streaming property testers 
against streaming algorithms and property testers, when they are not combined. Initiated by Feigenbaum 
et ai, a streaming property tester is a streaming algorithm recognizing a language under the property 
testing approximation: it must distinguish inputs of the language from those that are s-fai from it, while 
using the smallest possible memory (rather than limiting its number of input queries). 

Our main result is a streaming e-property tester for visibly pushdown languages (V PL) with one-sided 
error using memory space poly((logn)/e). 

This constructions relies on a (non-streaming) property tester for weighted regular languages based 
on a previous tester by Alon et al. We provide a simple application of this tester for streaming testing 
special cases of instances of V PL that are already hard for both streaming algorithms and property testers. 

Our main algorithm is a combination of an original simulation of visibly pushdown automata using 
a stack with small height but possible items of linear size. In a second step, those items are replaced by 
small sketches. Those sketches relies on a notion of suffix-sampling we introduce. This sampling is the 
key idea connecting our streaming tester algorithm to property testers. 
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1 Introduction 


Visibly pushdown languages (Vpl) play an important role in formal languages with crucial applications 
for databases and program analysis. In the context of structured documents, they are closely related with 
regular languages of unranked trees as captured by hedge automata. A well-known result @ states that, 
when the tree is given by its depth-first traversal, such automata correspond to visibly pushdown automata 
(Vpa) (see e.g. ifT^ for an overview on automata and logic for unranked trees). In databases, this word 
encoding of trees is known as XML encoding, where DTD specifications are examples of often considered 
subclasses of Vpl. In program analysis, Vpa also capture natural properties of execution traces of recursive 
finite-state programs, including non-regular ones such as those with pre and post conditions as expressed in 
the temporal logic of calls and returns (CaRet) ||5]|4l. 

Historically, Vpl got several names such as input-driven languages or, more recently, languages of 
nested words. Intuitively, a VPA is a pushdown automaton whose actions on stack (push, pop or nothing) 
are solely decided by the currently read symbol. As a consequence, symbols can be partitioned into three 
groups: push, pop and neutral symbols. The complexity of Vpl recognition has been addressed in various 
computational models. The first results go back to the design of logarithmic space algorithms nTI as well 
as NC^-circuits |[T3l . Later on, other models motivated by the context of massive data were considered, such 
as streaming algorithms and property testers (described below). 

Streaming algorithms (see e.g. H^l) have only a sequential access to their input, on which they can 
perform a single pass, or sometimes a small number of additional passes. The size of their internal (random 
access) memory is the crucial complexity parameter, which should be sublinear in the input size, and even 
polylogarithmic if possible. The area of streaming algorithms has experienced tremendous growth in many 
applications since the late 1990s. The analysis of Internet traffic in which traffic logs are queried, was 
one of their first applications. Nowadays, they have found applications with big data, notably to test graphs 
properties, and more recently in language recognition on very large inputs. The streaming complexity of 
language recognition has been firstly considered for languages that arise in the context of memory check¬ 
ing mils, of databases (2911281, and later on for formal languages (^ 171. However, even for simple Vpl, 
any randomized streaming algorithm with p passes requires memory D(re/p), where n is the input size (TSl . 

As opposed to streaming algorithms, (standard) property testers (H [101 US have random access to their 
input but in the query model. They must query each piece of the input they need to access. They should 
sample only a sublinear fraction of their input, and ideally make a constant number of queries. In order 
to make the task of verification possible, decision problems need to be approximated as follows. Given a 
distance on words, an e-tester for a language L distinguishes with high probability the words in L from those 
e-far from L, using as few queries as possible. Property testing of regular languages was first considered for 
the Hamming distance |Tj. When the distance allows sufficient modifications of the input, such as moves of 
arbitrarily large factors, it has been shown that any context-free language becomes testable with a constant 
number of queries (201 [151 . However, for more realistic distances, property testers for simple languages 
require a large number of queries, especially if they have one-sided error only. For example the complexity 
of an e-tester for well-parenthesized expressions with two types of parentheses is between and 

0(n^/^) (261 . and it becomes linear, even for one type of parentheses, if we require one-sided error (ll. The 
difficulty of testing regular tree languages was also addressed when the tester can directly query the tree 
structure (2^1251 . 

Faced by the intrinsic hardness of Vpl in both streaming and property testing, we study the complex¬ 
ity of streaming property testers of formal languages, a model of algorithms combining both approaches. 
Such testers were historically introduced for testing specific problems (groupedness) (T4l relevant for net¬ 
work data. They were later studied in the context of testing the insert/extract-sequence of a priority-queue 
structure ini. We extend these studies to classes of problems. A streaming property tester is a streaming 
algorithm recognizing a language under the property testing approximation: it must distinguish inputs of 


2 


the language from those that are e-far from it, while using the smallest possible memory (rather than limit¬ 
ing its number of input queries). Such an algorithm can simulate any standard non-adaptive property tester. 
Moreover, we will see that, using its full scan of the input, it can construct better sketches than in the query 
model. 

In this paper, we consider a natural notion of distance for VPL, the balanced-edit distance, which refines 
the edit distance on balanced words (where for each push symbol there is a matching pop symbol at the 
same height of the stack, and conversly). It can be interpreted as the edit distance on trees when trees are 
encoded as balanced words. Neutral symbols can be deleted/inserted, but any push symbol can only be 
deleted/inserted together with its matching pop symbol. Since our distance is larger than the standard edit 
distance, our testers are also valid for that distance. 

In Section |3l we first design an exact algorithm that maintains a small stack but whose items can be of 
linear size as opposed to the standard simulation of a pushdown automaton which usually has a stack of 
possible linear size but with constant size items. In our algorithm, stack items are prefixes of some peaks 
(which we call unfinished peaks), where a peak is a balanced factor whose push symbols appear all before 
the first pop symbol. Our algorithm compresses an unfinished peak u = u+V- when it is followed by a long 
enough sequence. More precisely, the compression applies to the peak v+v- obtained by disregarding part 
of the prefix of push sequence u+. Those peaks are then inductively replaced, and therefore compressed, by 
the state-transition relation they define on the given automaton. The relation is then considered as a single 
symbol whose weight is the size of the peak it represents. In addition, to maintain a stack of logarithmic 
depth, one of the crucial properties of our algorithm (Proposition 13.31) is rewriting the input word as a peak 
formed by potentially a linear number of intermediate peaks, but with only a logarithmic number of nested 
peaks. 

In Section |4l for the case of a single peak, we show how to sketch the current unfinished peak of 
our algorithm. The simplicity of those instances will let us highlight our first idea. Moreover, they are 
already expressive enough in order to demonstrate the superiority of streaming testers against streaming 
algorithms and property testers, when they are not combined. We first reduce the problem of streaming 
testing such instances to the problem of testing regular languages in the standard model of property testing 
fTheorem 14.91) . Since our reduction induces weights on the letters of the new input word, we need a tester 
for weighted regular languages fTheorem IA.2I) . Such a property tester has previously been devised in |[25l 
extending constructions for unweighted regular languages l|Tl|24l. However, we consider a slightly simpler 
construction that could be of independent interest. As a consequence we get a streaming property tester with 
polylogarithmic memory for recognizing peak instances of any given VPL (Theorem 14.101) . a task already 
hard for streaming algorithms and property testers fFact 14.11) . 

In Section [51 we construct our main tester for a VPL L given by some VPA. For this we introduce a 
more involved notion of sketches made of a polylogarithmic number of samples. They are based on a new 
notion of suffix sampling (Definition 15.11) . This sampling consists in a decomposition of the string into 
an increasing sequence of suffixes, whose weights increase geometrically. Such a decomposition can be 
computed online on a data stream, and one can maintain samples in each suffix of the decomposition using 
a standard reservoir sampling. This suffix decomposition will allow us to simulate an appropriate sampling 
on the peaks we compress, even if we do not yet know where they start. Our sampling can be used to 
perform an approximate computation of the compressed relation by our new property tester of weighted 
regular languages which we also used for single peaks. We first establish a result of stability which basically 
states that we can assume that our algorithm knows in advance where the peak it will compress starts 
(Lemma 15.61) . Then we prove the robustness of our algorithm: words that are e-far from L are rejected 
with high probability (Lemma 15.81) . As a consequence, we get a one-pass streaming e-tester for L with 
one-sided error rj and memory space (logn)®(log l/r/)/e'^), where m is the number of states of 

a Vpa recognizing L fTheorem 15.41) . 
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Algorithm 1: Reservoir Sampling 

1 Input: Data stream u, Integer parameter t > 1 

2 Data structure: 

3 (T •<—0 // Current weight of the processed stream 

4 S <r- empty multiset // Multiset of sampled letters 

5 Code: 

6 i •<— 1, at— Next(a), a t— |a| 

I S <— t copies of a 

8 While u not finished 

9 i + +, at—Next(u), cr t—cr + |a| 

10 For each b C S 

II Replace b by a with probability |a|/cr 
12 Output S 


2 Definitions and Preliminaries 

Let N* be the set of positive integers, and for any integer n € N*, let [re] = {1,2,..., re}. A t-subset of a 
set S is any subset of S of size t. For a finite alphabet S we denote the set of finite words over S by S*. 
For a word u = re(l)re(2) • • • re(re), we call re the length of re, and re(i) the ith letter in re. We write re[i, j] for 
the factor u{i)u{i + 1) • • • re(j) of re. When we mention letters and factors of re we implicitly also mention 
their positions in re. We say that re is a sub-factor of re', denoted re < re', if re = re[i,}] and re' = re[i', /] with 
[fj] ^ [*^/]■ Similarly we say that re = re' if [i,j] = [i',j']- If i < i' < j < j' we say that the overlap of re 
and re' is u[i',j]. If re is a sub-factor of re' then the overlap of re and re' is re. Given two multisets of factors S 
and S', we say that 5 < 5' if for each factor v € S there is a corresponding factor re' G S' such that re < re'. 

Weighted Words and Sampling. A weight function on a word u with n letters is a function A : [n] —N* 
on the letters of u, whose value A(i) is called the weight of u{i). A weighted word over S is a pair (re. A) 
where re G S* and A is a weight function on re. We define |re(i)| = X{i) and |re[i, j]| = X(i) + X{i + 1) + 
... + X{j). The length of (re. A) is the length of re. For simplicity, we will denote by re the weighted word 
(re. A). Weighted letters will be used to substitute factors of same weights. Therefore, restrictions may exist 
on available weights for a given letter. 

Our algorithms will be based on a sampling of small factors according to their weights. We introduce 
a very specific notion adapted to our setting. For a weighted word re, we denote by k-factor sampling on re 
the sampling over factors re[i, i -\-l] with probability |re(i)|/|re|, where / > 0 is the smallest integer such that 
|re[i, i + Z]| > A; if it exists, otherwise I is such that i + Hs the last letter of re. More generally, we call /c-factor 
such a factor. For the special case of /c = 1, we call this sampling a letter sampling on re. Observe that both 
of them can be implemented using a standard reservoir sampling (see Algorithm [U for letter sampling). 

Even if our algorithm will require several samples from a fe-factor sampling, we will often only be 
able to simulate this sampling by sampling either larger factors, more factors, or both. Let Wi be a 
sampler producing a random multiset 5i of factors of some given weighted word re. Then VV ’2 over¬ 
samples Wi if it produces a random multiset S 2 of factors of re such that for each factor re, we have 
Pr(3re' G S 2 such that re is a factor of re') > Pr(3re' G S\ such that re is a factor of re'). 

Finite State Automata and Visibly Pushdown Automata. A. finite state automaton is a tuple of the form 
A = (Q, S, Qin, Qf, A) where Q is a finite set of control states, S is a finite input alphabet, Qin C Q is a 
subset of initial states, Qf C Q is a. subset of final states and ACQxSxQisa transition relation. We 
write p-^q, to mean that there is a sequence of transitions in A from pto q while processing re, and we call 
(p, q) a u-transitions. A word re is accepted if qin — >qf for some qin G Qm and qf G Qf. The language 
L{A) of A is the set of words accepted by A, and we refer to such a language as a regular language. For 
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s' C S, the T,'-diameter (or simply diameter when S' = S) of ^ is the maximum over all possible pairs 
(p, q) G of min{|ri| : p — >q and u G S'*}, whenever this minimum is not over an empty set. We say 

that A is Ti'-closed, when p-^q for some u G S* if and only if p-^q for some u' G S'*. 

A pushdown alphabet is a triple (S+,S_,S=) that comprises three disjoint finite alphabets: S+ is a 
finite set of push symbols, S_ is a finite set of pop symbols, and S= is a finite set of neutral symbols. For any 
such triple, let S = S+ U S_ U S=. Intuitively, a visibly pushdown automaton ll27ll over (S+, S_, S=) is a 
pushdown automaton restricted so that it pushes onto the stack only on reading a push, it pops the stack only 
on reading a pop, and it does not modify the stack on reading a neutral symbol. Up to coding, this notion is 
similar to the one of input driven pushdown automata If 22 l and of nested word automata [[ 6 , 1 . 

Definition 2.1 (Visibly pushdown automaton |[27l ). A visibly pushdown automaton (VPAj over (S+, S_, S=) 
is a tuple A = {Q, S, F, Qin, Qf, A) where Q is a finite set of states, Qin C Q is a set of initial states, 

Qf ^ Q is a set of final states, T is a finite stack alphabet, and A C (Q x S+ x Q x F) U (Q x S_ x F x 

Q) U (Q X X Q) is the transition relation. 

To represent stacks we use a special bottom-of-stack symbol ± that is not in F. A configuration of a 
Vpa is a pair {a, q), where q ^ Q and u G -L • F*. For a G S, there is an a-transition from a configuration 
(cr, q) to (fj', q'), denoted {a, q) — >{a', q'), in the following cases: 

• If a is a push symbol, then a' = ay for some {q, a, q',y) G A, and we write q-^{q', push( 7 )). 

• If a is a pop symbol, then a = a'y for some {q, a, 7, q') G A, and we write {q, pop( 7 )) — >q'. 

• If a is a neutral symbol, then a = a' and {q, a, q') G A, and we write q-^q'. 

For a finite word u = ai • • • G S*, if {ai-i,qi-i)-^{ai,qi) for every 1 < f < n, we also 

write ((To, go)— >{(^n,qn)- The word u is accepted by a VPA if there is {p,q) G Qin x Qf such that 

(_L,p)^T).(_L, g). The language L(A) of A is the set of words accepted by A, and we refer to such a 
language as a visibly pushdown language (Vpl). 

At each step, the height of the stack is pre-determined by the prefix of u read so far. The height height (rt) 
of u G S* is fhe difference befween fhe number of ifs push symbols and of ifs pop symbols. A word u is 
balanced if height(n) = 0 and height(n[l, f]) > 0 for all i. We also say fhaf a push symbol u{i) matches a 
pop symbol u{j) if height(rt[z,j]) = 0 and height(tt[f, k]) > 0 for alH < A: < j. By extension, the height 
of u{i) is height(rt[l, i — 1 ]) when u{i) is a push symbol, and height(ri[l, i]) otherwise. 

For all balanced words u, the property {a,p) — 7 >((T, g) does not depend on a, therefore we simply write 
p-^q, and say that (p, g) is a u-transition. We also define similarly to finite automata the Yi'-diameter of A 
(or simply diameter) and the notion A being Y'-closed on balanced words only. 

Our model is inherently restricted to input words having no prefix of negative stack height, and we 
defined acceptance with an empty stack. This implies that only balanced words can be accepted. From now 
on, we will always assume that the input is balanced as verifying this in a streaming context is easy. 

Balanced/Standard Edit Distance. The usual distance between words in property testing is the Hamming 
distance. In this work, we consider an easier distance to manipulate in property testing but still relevant for 
most applications, which is the edit distance, that we adapt to weighted words. 

Given a word u, we define two possible edit operations: the deletion of a letter in position i with 
corresponding cost |ri(t) |, and its converse operation, the insertion where we also select a weight, compatible 
with the restrictions on A, for the new u{i). Then the (standard) edit distance dist(ri, n) between two 
weighted words u and v is simply defined as the minimum total cost of a sequence of edit operations 
changing u to v. Note that all letters that have not been inserted nor deleted must keep the same weight. For 
a restricted set of letters S', we also define dists'(M, v) where the insertions are restricted to letters in S'. 

We will also consider a restricted version of this distance for balanced words, motivated by our study 
of Vpl. Similarly, balanced-edit operations can be deletions or insertions of letters, but each deletion of 
a push symbol (resp. pop symbol) requires the deletion of the matching pop symbol (resp. push symbol). 
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Similarly for insertions: if a push (resp. pop) symbol is inserted, then a matching pop (resp. push) symbol 
must also be inserted simultaneously. The cost of these operations is the weight of the affected letters, as 
with the edit operations. We define the balanced-edit distance bdist(n, v) between two balanced words as 
the total cost of a sequence of balanced-edit operations changing u to v. Similarly to dists'(rt, v) we define 
bdists'(tt, v). 

When dealing with a visibly pushdown language, we will always use the balanced-edit distance, whereas 
we will use the standard-edit distance for regular languages. We also say that u is (e, S')-/ar from v if 
dists'(^)^) > or bdists'(u, n) > e|u|, depending on the context; otherwise we say that u is (e, S')- 
close to V. We omit S' when S' = S. 

Streaming Property Testers. An e-tester for a language L accepts all inputs which belong to L with 
probability 1 and rejects with high probability all inputs which are e-far from L, i.e. that are e-far from any 
element of L. In particular, a tester for some given distance is also a tester for any other smaller distance. 
Two-sided error testers have also been studied but in this paper we stay with the notion of one-sided testers, 
that we adapt in the context of streaming algorithm as in llT4l . 

Definition 2.2 (Streaming property tester). Let e > 0 and let L be a language. A streaming e-tester/or L 
with one-sided error rj and memory s{n) is a randomized algorithm A such that, for any input u of length n 
given as a data stream: 

• Ifu^L, then A accepts with probability 1; 

• If u is e-far from L, then A rejects with probability at least 1 — rj; 

• A processes u within a single sequential pass while maintaining a memory space ofO{s{n)) bits. 

3 Exact Algorithm 

Fix a Vpa recognizing some VPL L on S = U S_ U S=. In this section, we design an exact streaming 
algorithm that decides whether an input belongs to L. Algorithm |2] maintains a stack of small height but 
whose items can be of linear size. In Section [5l we replace stack items by appropriated small sketches 

3.1 Notations and Algorithm Description 

Call a peak a sequence of push symbols followed by a sequence of pop symbols, with possibly intermediate 
neutral symbols, i.e. an element of the language A = IJj>o((^=)* ' ' (^=)* ' ' (S=)*)A One can 

compress any pick n G A by the set = {{p, q) : p-^q} of the n-transitions, and consider as a new 
neutral symbol with weight |r;|. In fact, for the purpose of the analysis of our algorithm, we augment neutral 
symbols by many more relations for which A remains S-closed. For the rest of the paper, they will be the 
only symbols with weight potentially larger than 1. 

Definition 3.1. Let Sg be S= augmented by all weighted letters encoding a relation R f Q x Q such that 
for every {p, q) ^ R there is a balanced word u ^ TA with p-^q. Let Aq be A where S= is replaced by 
Sg. 

R 

We then write p—>q whenever {p, q) € R, and extend A and L accordingly. Of course, our notion of 
distance will be solely based on the initial alphabet S. 

A general balanced input instance u will consist of many nested peaks. However, we will recursively 
replace each factor v € Ag by R^ with weight |r;|. 

Denote by Prefix(Ag) the language of prefixes of words in Ag. While processing the prefix u[l,i] of 
the data stream u, Algorithm |2] maintains a suffix uq G Prefix(Ag) of u[l,i], that is an unfinished peak, 
with some simplifications of factors v in Ag by their corresponding relation Therefore uq consists of a 
sequence of push symbols and neutral symbols possibly followed by a sequence of pop symbols and neutral 
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Algorithm 2: Exact Tester for a VPL 
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10 

11 

12 
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Input: Balanced data stream u 

Data structure: 

Stack <r- empty stack // Stack of items v with v € Prefix(AQ) 
uo ^ 0 // uo G Prefix(AQ) is a suffix of the processed part u[l,z] of u 
// with possibly some factors v G Ag replaced by Ry 
Rtemp{ip,p)}pcQ // Set of transitions for the maximal prefix of u[l,i] Ag 

Code: 

While u not finished 

a •(— Next(u) //Read and process a new symbol a 

If a G E+ and uq has a letter in E_ // uq • a ^ Prefix(Ag) 

Push uq on Stack, uq a 
Else Uq Uq • a 

If Uq is balanced // wqG Ag ; compression 
Compute Rug the set of UQ-transitions 
If Stack = %, then i?temp ^ f^temp O ' Uo ■(—0 

// where o denotes the composition of relations 
Else Pop V from Stack, uq <— v ■ Rug 
Let (ui • 722) t—top(S'tacfc) s.t. V2 is maximal and balanced // 722 G Ag 
If luol > ^21/2 // Uq is big enough and 722 can be replaced by Ry^ 

Compute Ry^ the set of 722-transitions, Pop 72 from Stack, uq <—(vi ■ Ry^) • Uq 
If (Qjn X Q/) n i?temp 0 ^ Accept; Else Reject // Rtemp = Ru 


symbols. The algorithm also maintains a subset i?temp ^ Q k Q that is the set of transitions for the maximal 
prefix of m[1, i] in Ag. When the stream is over, the set i?tenip is used to decide whether u G L or not. 

When a push symbol a comes after a pop sequence, tiq ■ a is no longer in Prefix(AQ) hence. Algorithmic] 
puts Uq on the stack of unfinished peaks (see lines [TO] fo HT] and Figure [Tall and uq is resef fo a. In other 
situations, it adds a to uq. In case uq becomes a word in Ag (see lines [TSl to [T tI and Figure ITbl). Algorithmic] 
computes the set of tiq -transitions R^q G Sg, and adds Ruq to the previous unfinished peak that is retrieved 
on top of the stack and becomes the current unfinished peak; in the special case where the stack is empty 
one simply updates the set i2temp by taking its composition with R^q. 

3.2 Algorithm Analysis 

We now introduce the quantity Depth(72) for each factor v constructed in Algorithm |2] It quantifies the 
number of processed nested picks in v as follows: 

Definition 3.2. For each factor constructed in Algorithm^ Depth is defined dynamically by Depth(a) = 0 
when a € T,, Depth(7;) = max; Depth(72(i)) and Depth(i?^,) = Depth(7;) + 1. 

In order to bound the size of the stack. Algorithm |2] considers the maximal balanced suffix 722 of the 
topmost element vi ■V 2 of the stack and, whenever |uo| > |u 2 |/ 2 , it computes the relation Ry^ and continues 
with a bigger current peak starting with 721 (see lines HH to |20] and Figure IT^ . A consequence of this 
compression is that the elements in the stack have geometrically decreasing weight and therefore the height 
of the stack used by Algorithm |2] is logarithmic in the length of the input stream. This can be proved by a 
direct inspection of Algorithm |2] 

Proposition 3.3. Algorithm^ accepts exactly when u & L, while maintaining a stack of at most log | 77 | 
items. 

We state that Algorithm |2l when processing an input u of length n, considers at most 0(log?7) nested 
picks, that is Depth(7;) = 0(log n) for all factors constructed in Algorithm [2l 
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RestofS'facfc Top oi Stack uq a Rest of Stack 

(a) Illustration of lines fTOl to fTTI from Algorithm|2] 
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Rest of Stack Top of Stack uq Stack 

(b) Illustration of lines fOl to fTTI from Algorithmic] 
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(c) Illustration of lines IT^ to l20l from Algorithm]!] 


Figure 1: Illustration of Algorithm ]2] 
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Lemma 3.4. Let v be the factor used to compute Ry at line either [74] or |20| of Algorithm |2] Then 
b(*)| < 2|?;|/3, for all i. Moreover, for any factor w constructed by Algorithm\2\it holds that Depth('u;) = 
0 (log |Tt;|). 

Proof One only has to consider letters in Sq. Hence, let Rw belongs to v for some w. either w was 
simplified into R^ at line [14] or at line [20] of Algorithm [2] 

Let us first assume that it was done at line [20] Therefore, there is some v' E Prefix(AQ) to the right 
of w with total weight greater than |m|/2 = |i?^|/2. This factor v' is entirely contained within v. indeed, 
when Ryj is computed v includes v'. Therefore \R-w\ < 2|t)|/3. 

If Ry, comes from line [Td] then w = uq and this uq is balanced and compressed. We claim that at the 
previous round the test in line [19] failed, that is |uo| — 1 < |w2|/2 where V 2 is the maximal balanced suffix 
of top(5facA:). Indeed, when performing fhe sequence of acfions following a positive fesf in line [19] fhe 
number of unmafched push symbols in fhe new uq is augmented af leasf by 1 from fhe previous uq: hence, if 
cannof be equal fo 1 as fhe elemenfs in fhe slack have pending call symbols and Iherefore in fhe nexl round 
uo cannof be balanced. Therefore one has |mo| ~ 1 < ^21 /2- Now when Ry, = Ry^ is created, if is confains 
in a factor fhaf also confains V 2 and al leasf one pending call before V 2 - Hence, |7?^| < 2|u|/3. 

Finally, fhe facl fhaf for any facfor w consfrucfed by Algorilhm[2] Depth(m) = 0(log [mj) derives from 
the fact that if Depth(t(;) = k, then Iml > (3/2)^. This can in turn be shown by induction on the depth. 
Obviously any factor will have weight at least 1. Let us assume all factors of depth k have weight at least 
(3/2)^, and let w{i) be a letter such that Depth(t(;(f)) = /c + 1. By definition, w{i) = Ry for some factor v 
with Depth(u) = k. This means v contains at least one letter v{j) of depth k. By our induction hypothesis, 
|?^(j)| > (3/2)^, and therefore |tu(i)| = |u| > (3/2)|u(j)| > (3/2)^+^. □ 

4 The Special Case Of Peaks 

We now consider restricted instances consisting of a single peak. For these instances. Algorithm [2] never 
uses its stack but uq can be of linear size. We show how to replace uq by a small random sketch in order 
to get a streaming property tester using polylogarithmic memory. In Section [5] this notion of sketch will be 
later extended to obtain our final sfreaming properly tester for general insfances. 

4.1 Hard Peak Instances 

Peaks are already hard for bofh sfreaming algorilhms and properly lesling algorilhms. Indeed, consider fhe 
language Disj C A over alphabel S = {0,1,0,1, a} and defined as fhe union of all languages 
... • x{j) ■ a* ■ y{j) - a* ■ ■ y{l) ■ a*, where j > I, x,y € {0, Ip, and x(i)y{i) / 1 for all i. 

Then Disj can be recognized by a VPA wifh 3 slates, S+ = {0,1}, S- = {0,1} and S= = {o}. How¬ 
ever, fhe following facl slates ifs hardness for bofh models. The hardness for non-approximafion sfreaming 
algorilhms comes for a slandard reducfion to Sel-Disjoinlness. The hardness for properly lesling algorilhms 
is a corollary of a similar resull due lo |[26]| for parenlhesis languages wifh Iwo lypes of parenlheses. 

Fact 4.1. Any randomized p-pass streaming algorithm for Disj requires memory space Ll{n/p), where n 
is the input length. Moreover, any (non-streaming) {2~^)-tester for Disj requires to query log n) 

letters of the input word. 

Proof. The Sel-Disjoinlness problem is defined as follows. Two players have respeclively a A and B of 
{!,... ,n} and Ihey musl oulpul whelher An B = 0. The communicafion complexify of Ihis problem is 
well known lo be D(n). Therefore using fhe slandard reducfion of sfreaming algorilhms to communicafion 
profocols, any randomised p-pass algorilhm for Disj will require memory space Ll{n/p). 

To prove fhe hardness of lesling Disj in fhe query model, we use a resull from |[26l (Theorem 2) which 
slafes fhaf any Hamming dislance query model properly tester for PAR 2 fi A Ihe language on Ihe alphabel 
{(, [,],), *} consisting of well-parenlhesized words lhal are also in A requires queries. 
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Figure 2: Slicing of a word u € A and evolution of the stack height for u. 


We first note that because of the way PAR 2 nA is constructed, the Hamming distance and the edit distance 
of any word in A to PAR 2 n A are within a constant factor of one another. Indeed, if a sequence of insertions 
brings some word u inside PAR 2 H A, then the deletions of the parentheses matching the insertions would do 
the same. And all deletions can similarly be replaced by a substitution of the character being deleted with *. 

It is also easy to reduce that language to Disj: we replace ( by 01, ) by 01, [ by 10, and ] by 10. □ 


Surprisingly, for every e > 0, we will show that languages of the form L n A, where L is a VPL, become 
easy to e-test by streaming algorithms. This is mainly because, given their full access to the input, streaming 
algorithms can perform an input sampling which makes the property testing task easy, using only a single 
pass and few memory. 

4.2 Slicing Automaton 

Observe that Algorithm |2] will never use the stack in the case of a single peak. After Algorithm |2] has 
processed the i-th letter of the data stream, uq contains u[l,i]. We will show how to compute Ru^ at line [14] 
using a standard finite state automaton without any stack. 

Indeed, for every VPL L, one can construct a regular language L such that testing whether m € L n A 
is equivalent to test whether some other word u belongs to L. For this, let I be a special symbol not in 
11= encoding the relation set {{p,p) : p € Q}. For a word v € SL, write [n,!] for the word (r;(l),I) • 

(n(2), I) • • • (r;(/), I), and similarly [I, v\. Consider a weighted word of the form u = (^ 01=1 ' ^ 3 +^ ' 

( Wl=j ) where Oj G II+, bi € S_, and Vi,Wi G S*. Then the slicing of u (see FigureO is the word u 

over the alphabet S = (S+ x S_) U (S= x {!}) U ({1} x S=) defined by 2 = ^ ni=ibG I] ‘ [Ij Wi] ■ {ai, • 

K+iOj- 


Definition 4.2. Let A = (Q, S, F, Qy, A) be a Vpa. The slicing of A is the finite automaton 
A = {Q,T,,Qin,Qf,A) where Q = Q x Q, Qin = Qin x Qj, Qf = {{p,p) : p G Q}, and the tran¬ 
sitions A are: 


1. {p, q)-^{p', q') when p — >{p', push(7)) and {q', pop(7)) —>q are both transitions of A. 


2 . 


{Pi q)^^A{p', q), resp. {p, q)^^A{p, q'), when pAAp', resp. q-A-q', is a transition of A. 
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This construction will be later used in Section |5] for weighted languages. In that case, we define the 
weight of a letter in u by |(a, 5)| = |a| + |6|, with the convention that |/| = 0. Moreover, we write Eg for 
the alphabet obtained similarly to S using Eg instead of E=. Note that the slicing automaton A defined on 
Eg is E-closed and has E-diamefer af mosf 2m?. 

Lemma 4.3. If A is a Vpa accepting L, then A is a finite automaton accepting L = {n : n G L PI A}. 

Proof. Because fransifions on push symbols do nof depend on fhe lop of Ihe slack, Iransilions in A corre¬ 
spond lo slices thal are valid for A (see Figure lUl. Finally, Qin ensures lhal a run for L musl sfarl in Qin and 
end in Qj, and Qf lhal a slate al Ihe lop of Ihe peak is consislenl from bolh sides. □ 

Proposition 4.4. Let v ^ Abe s.t. {p, q)-A>{p', q'). There is w € A s.t. |ui| < 2m^ and {p, q)-^{p', q'). 

4.3 Random Sketches 

We are now ready to build a tester for L n A. To test a word u we use a property tester for the regular 
language L. Regular languages are known to be e-testable for the Hamming distance with 0((log l/e)/e) 
non-adaptive queries on the input word (T], that is queries that can all be made simultaneously. Those 
queries define a small random sketch of uq that can be sent to the tester for approximating Ru^. Since the 
Hamming distance is larger than the edit distance, those testers are also valid for the latter distance. Observe 
also that, for u,v £ Ag, we have bdist(rt, n) < 2dist(n, n). The only remaining difficulty is to provide to 
the tester an appropriate sampling on u while processing u. 

We will proceed similarly for the general case in Section |5l but then we will have to consider weighted 
words. Therefore we show how to sketch uq in that general case already. Indeed, the tester of |T] was 
simplified for the edit distance in flM . and later on adapted for weighted words in ll25l . We consider here an 
alternative approach that we believe simpler, but slightly less efficient than the tester of 1251. In particular, we 
introduce in Appendix a new criterion, K-saturation, that permits to significantly simplify the correctness 
proof of the tester compared to the one in |T| and in l25l . 

Our tester for weighted regular languages is based on fe-factor sampling on u that we will simulate by 
an over-sampling built from a letter sampling on u, that is according to the weights of the letters of u only. 
This new sampling can be easily performed given a stream of u using a standard reservoir sampling. 

Definition 4.5. For a weighted word u € Ag, denote by Wk (u) the sampling over subwords ofu constructed 
as follows (see Figure\^: 

(1) Sample a factor u[i, i + k] ofu with probability |rt(i)|/|rt|. 

(2) Ifu(i)is in the push sequence ofu, let u[j,j'] be the matching pop sequence of u\ifi + k\ including the 
first k neutral symbols after the last pop symbol, if any. Add u[j' — 2k, j'] to the sampleL\ 

Fact 4.6. There is a randomized streaming algorithm with memory 0{k + logn) which, given k and u as 
input, samples Wk{u). 

Proof. (1) can easily be obtained using reservoir sampling. If the sampling enters the pop sequence as the 
current candidate is part of the push sequence, then (2) can be done for that candidate, and forgotten if the 
sampling eventually picks another one. That eventual candidate will not be part of the push sequence, so we 
are done. □ 

Lemma 4.7. Let u be a weighted word, and let k be such that 4fc < |n|. Then Ak independent copies of 
Wfc(rt) over-sample the k-factor sampling on u. 

'Some matching pops of u\i, i k] may be ignored. 
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uii) < - > uii + k) u(j) u{j' — 2k)< - ^ uij') 

k + l ^ ’ yJ! > 2/t + l ^ ' 


Figure 3: The sampling Wk{u) from Definition 14.51 sample is in red, dotted parts are for omitted neutral 
symbols 


Proof. Denote by W the fc-factor sampling on u, and by W some 4/c independent copies of For any 

fe-factor V of u, we will show that the probability that v is sampled by W is at most the probability that v is 
a factor of an element sampled by W. For that, we distinguish the following three cases: 

• V contains only letters in {/} x Sg. Then the probability that v is sampled by W is equal to the 

probability that it is sampled by Wk{u) in step (1). 

• V starts by a letter (a, b) in S_|_ x S_ or by a letter in X {/}. Then the probability that the 

u{i) selected by Wkiu) is a is at least half of the probability that Wk{u) samples v, as a (push,pop) 
pair in u has weight 2 while a push has weight 1 in u. Because i; is a A:-factor, it is contained in 
{u[i, i + k],u[j' — 2k, j']). Hence, the probability that v is sampled by W is at most the probability 

that r; is a factor of an element sampled by VV’fc(tt) in step (2). 

• V starts by a letter in {/} x Eg but also contains letters outside of this set. Since |u| > |u|/2, we get 

Pr(Wfc(M) samples v) > l/|ri| and Pr(W samples v)<k/\u\ < 2k/\u\. 

Thus the probability that one of the 4k samples of W has the factor v is at least 1 — (1 — l/\u\)^^. 
As 1 — (1 — l/\u\Y^ > 1 — i_|_ 4 fc/|„| = \u\+ik — 2fe/|ri| when |w| > 4k, we conclude again that the 

probability that v is sampled by W is at most the probability that i; is a factor of an element sampled 
by Wk{u) in step (2). 


□ 

We can now give an analogue of the property tester for weighted regular languages in L n Ag. For that, 
we use the following notion of approximation. 

Definition 4.8. Let R C Q'^. Then R {e, E)-approximates a balanced word u G (E+ U E_ U Eg)* on A, if 
for all p,q G Q: (1) {p, q) G R when p-^q; (2) u is {e, Tf-close to some word v satisfying p-^q when 
{p, q) G R. 

Our tester is going to be robust enough in order to consider samples that do not exactly match the peaks 
we want to compress. 

Theorem 4.9. Let A be a Vpa with m > 2 states and Tj-diameter d > 2. Let e > 0, rj > 0, t = 
2\4dm^(log l/r/)/e], k = \4dm/£ \ and T = 4kt. There is an algorithm that, given T random subwords 
zi, ... ,zt of some weighted word v G Ag, such that each Zi comes from an independent sampling yVk{v), 
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outputs a set R C Q X Q that (e, T,)-approximates v on A with bounded error rj. 

Let v' be obtained from v by at most e\v\ balanced deletions. Then, the conclusion is still true if the algorithm 
is given an independent Wkiv') for each Zi instead, except that R now provides a (3e, Tf-approximation. 
Last, each sampling can be replaced by an over-sampling. 

Proof. The argument uses as a subroutine the algorithm of Theorem IA.2I for A, where A has been ex¬ 
tended to Sq. Recall that A is E-closed and its S-diameter is also the S-diameter of A. Also observe that 
bdists(rt,r”) < 2distg(n,u). 

By Lemma 14771 the T independent samplings Wk{v) provide us the sampling we need for Theorem lA.2l 
For the case where we do not have an exact fe-factor sampling on v however, we need to compensate for 
the prefix of v of size e\v\ that may not be included in the sampling. This introduces potentially an additional 
error of weight 2e|r;| on the approximation R. □ 

As a consequence we get our first streaming tester for L n A. 

Theorem 4.10. Let Abe a VPA/or L with m>2 states, and let £,r] > 0. Then there is a streaming e-tester 
for L n A with one-sided error p and memory space 0((m® \og{l/p) / £^){rrf‘/e + logn)), where n is the 
input length. 

Proof. We use Algorithm [2] where we replace the current factor uq by T = Akt independent samplings 
Wk{uo). We know that such samplings can be computed using memory space 0{k + logn) by Fact l4.6l 
By Proposition 14.41 the slicing automaton has S-diameter d at most 2m^. Therefore, from Theorem l4.91 
taking t = A\Admf‘{log l/p)/e'\ and k = [ddm/e] leads to the desired conclusion. □ 

5 Algorithm With Sketching 

5.1 Sketching Using Suffix Samplings 

We now describe the sketches used by our main algorithm. They are based on the generalization of the 
random sketches described in Section 1431 Moreover, they rely on a notion of suffix samplings, that ensures 
a good letter sampling on each suffix of a data stream. Recall that the letter sampling on a weighted word u 
samples a random letter u{i) (with its position) with probability |u(f)|/|tt|. 

Definition 5.1. Let u be a weighted word and let a > 1. An a-suffix decomposition of u of size s (see 
Figure^ is a sequence of suffixes {u^)i<i<s of u such that: = u, is the last letter of u, and for all I, 

is a strict suffix of and if\u^\ > then v} = a ■ where a is a single letter. 

An (a, f)-suffix sampling on u of size s is an a-suffix decomposition ofu of size s with t letter samplings 
on each suffix of the decomposition. 

An {a, t)-suffix sampling can be either concatenated to another one, or compressed as stated below. 

Proposition 5.2. Given as input an {a, t)-suffix sampling Du on u of size Su and another one Dy on v of 
size Sy, there is an algorithm Concatenate {Dy, Dy) computing an {a,t)-suffix sampling on the concate¬ 
nated word u ■ V of size at most + Sy in time 0(s„). 

Moreover, given as input an {a,f)-suffix sampling Dy on u of size Sy, there is also an algorithm 
Simplify{Dy) computing an (a, t)-suffix sampling on u of size at most 2[log \u\/ log a] in time 0(s„). 

Proof. We sketch those procedures. They are fully described in Algorithm |3] For Concatenate, it suffices 
to do the following. For each suffix v} of Dy. (1) replace v} by v} ■ v, and (2) replace the i-th sampling of 
by the f-th sampling of v with probability |/(|r(| + |), for f = 1 ,..., t. 

For Simplify, do the following. For each suffix n* of Dy, from I = Sy (the smallest one) to ( = 1 (the 
largest one): (1) replace all suffixes ,..., n”* by the largest suffix n"* such that |n"*| < a|n^|; and 

(2) suppress all samples from deleted suffixes. □ 
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Figure 4: An a suffix decomposition of u of size s. For every I either \u^ \ < | or = a • where 

a is a letter. 


Using this proposition, one can easily design a streaming algorithm constructing online a suffix decom¬ 
position of polylogarithmic size. Starting with an empty suffix-sampling S, simply concatenate S with 
the next processed letter a of the stream, and then simplify it. We formalize this, together with functions 
Concatenate and Simplify, in Algorithm [3] 

Lemma 5.3. Given a weighted word u as a data stream and a parameter a > 1, Online-Suffix-Sampling 
in Algorithm\3\constructs an a-suffix sampling on u of size at most 1 + 2 [log |u|/ log a\. 

One can then slightly modify Algorithm [3] so that within each suffix of the decomposition it simulates t 
letter samplings in order to construct an [a, f)-suffix sampling. 

5.2 The Algorithm 

Our final algorithm is a modification of Algorithm ID in particular it will approximate relations Ry (in the 
spirit of Definition 14.81) . instead of exactly computing them. Therefore, it may fail at various steps and 
produce relations that do not correspond to any word. But still, it will produce relations R such that for any 
{p, q) G R, there is a balanced word u G S* with p — >q, that is i? G Sq. 

To mimic Algorithm |2] we need to encode (compactly) each unfinished peak v of the stack and uq: 
for that we use the data structure described in Algorithm |4l Our final algorithm. Algorithm |5l is simply 
Algorithm |2] with this new data structure and corresponding adapted operations, where s' = e/(61ogn). 

We now detail the methods, where we implicitly assume that each letter processed by the algorithm 
comes with its respective height and (exact or approximate) weight. They use functions Concatenate and 
Simplify described in Proposition I5.2l tand in details in Algorithm (3]) , while adapting them. 

In the next section, we show that the samplings Sp are close enough to an (1 -|- e')-suffix sampling on 
vK This let us build an over-sampling of an (1 -|- e')-suffix sampling. We also show that it only requires a 
polylogarithmic number of samples. Then, we explain how to recursively apply the tester from Theorem l4.9l 
(with s') in order to obtain the compressions at line [14] and |20] while keeping a cumulative error below e. We 
now state our main result whose proof relies on Lemmas l5 .6 1 and l5^ 

Theorem 5.4. Let A be a VPA/or L with m > 2 states, and let e,r] > 0. Then there is an s-streaming 
algorithm for L with one-sided error rj and memory space 0{wS‘2^^ (log® ?T-)(log l/r/)/e^), where n is the 
input length. 

Proof. We use Algorithm jS] which uses the tester from Theorem l4.9l for the compressions at lines [141 and l20l 
of Algorithm [2] We know from Lemma [5^ and Lemma 1477] that it is enough to choose s' = e/(61ogn), 
rj' = rj/n, and Fact 15.51 gives us d = 2”*^. Therefore we need T = 2304m^2^”^^(log^ n)(log l/r/)/e^ 
independent A:-factor samplings of u augmented by one, with k = 24m2™ (logn)/e. Lemma 153] tells us 
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Algorithm 3: a-Suffix Sampling 


1 

Data structure: 

2 

// D, Du, Dy, Zitemp Stacks of Items {a,b), one for each suffix 

3 

// of the decomposition where a encodes the weight and b the t samples 

4 

Code: 

5 

Concatenate(-D„, Dy) 

6 

D b- Du 

7 

{ci,... ,Ct) <r- all t samples on v (the largest suffix in Dy) 

8 

For each {a,b) G S where b = {bi,... ,bt) 

9 

Replace each bi by Ci with probability |'r’|/(|r'| + cr) 

10 

Replace {a,b) by (cr+|7;|,6) 

11 

Append Dy to the top of D 

12 

Return D 

13 

Simplify(i:)„) 

14 

D G- Du 

15 

For each {a,b) G D from top to bottom 

16 

-Dtemp ^ elements (t,c) G D below (a,h) with t < aa 

17 

Replace Utemp in D by the bottom most element of Utemp 

18 

Return D 

19 

Online-Suffix-Sampling 

20 

D^% 

21 

While u not finished 

22 

a G- Next(M) 

23 

Concatenate(Zi, a) where a encodes the suffix sampling (|a|, (a,..., a)) 

24 

Simplify(_D) 

25 

Return D 


Algorithm 4: Sketch for an unfinished peak 

1 Parameters: real e' > 0, integer T >1 

2 Data structure for a weighted word G Prefix(AQ) 

3 Weights of v and of its first letter r(l) 

4 Height of w(l) 

5 Boolean indicating whether v contains a pop symbol 

6 (1 + e')-suffix decomposition of v encoded by 

7 Estimates |w*|iom and |n\igh of |n*| 

8 T independent samplings Syi on // see details below 

9 with corresponding weights and heights 
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Algorithm 5: Adaptation of Algorithm [2] using sketches 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


Run Algorithm [2| using data structure from Algorithm [4| and with the following adaptations: 

Adaption of functions from Proposition 15,21 

ConcatBnatB{Du, Dy) with an exact estimate of |t;| is modified s.t. 
the repiacement probability is now |v|/(|ii|high + |^|) 
and \u^ ‘ v\z\u^\z\v\f for 2 = low,high 
Simplify(Dii) with a — 1 + a' has now the relaxed condition |w"^|high ^ (1 + 

Adaption of operations on factors used in Algorithm [2| 

Compute relation: Ry 

Run the algorithm of Theorem 14 . 91 using samples in Dy 

Decomposition: vi ■ V 2 v 

Find largest suffix in Dy s.t. 7 ;* G Prefix(AQ) // i.e. s.t. is in V 2 
^v\vi suffixes with their samples 

Dy^ -ir- suffix with its samples and weight estimates: // for computing Ry^ 
~ (Ir’Nhigh, I'l^^liow) when v'^~^ and v'^ differ by exactly one letter (then = V 2 ) 

- (k*“^|high, otherwise 

Test: |wo| > |'r2|/2 using |r 2 |iow instead of |r 2 | 

Concatenation: uq ^ (ri • Rv 2 ) • uq 

Dyi ^ replacing each samples of Dy\y^ in V 2 by Ry^ 

\\ The height of a sample determines whether it is in V 2 

Dyy Simplify(Concatenate(£>„/, 


that using twice as many samples from our algorithm, that is for each S^i, is enough in order to over-sample 
them. 

Because of the sampling variant we use, the size of each decomposition is at most 96(log^n)/e -|- 
O(logn) by Lemma 1531 The samplings in each element of the decomposition use memory space k, and 
there are 2T of them. Furthermore, each element of the stack has its own sketch, and the stack is of 
height at most log n. Multiplying all those together gives us the upper bound on the memory space used by 
Algorithm m □ 

5.3 Final Analysis 

As Algorithm |5]may fail at various steps, the relations it considers may not correspond to any word. However, 
each relation R that it produces is still in Sq. Furthermore, the slicing automaton A that we define over Eg 
is E-closed. Fact [ 53 ] below bounds the S-diameter of A (which is equal to the E-diameter of A) by 2 ™ . 
Note that for simpler languages, as those coming from a DTD, this bound can be lowered to m. 

2 

Fact 5.5. Let Abe a Ypa with m states. Then the 11-diameter of A is at most 2”^ . 

Proof. A similar statement is well known for any context-free grammar given in Chomsky normal form. 
Let N be the number of non-terminal symbols used in the grammar. If the grammar produces one balanced 
word from some non-terminal symbol, then it can also produce one whose length is at most 2^ from the 
same non-terminal symbol. This is proved using a pumping argument on the derivation tree. We refer the 
reader to the textbook |[T7]| . 

Now, in the setting of visibly pushdown languages one needs to transform A into a context-free grammar 
in Chomsky normal form. For that, consider first an intermediate grammar whose non-terminal symbols are 
all the Xpq where p and q are states from A', such a non-terminal symbol will produce exactly those words 
u such that p-fq, hence our initial symbol will be those of the form Xq^q^ where go is an initial state and 
qj is a final state. The rewriting rules are the following ones: 

• Xpp —>■ £ 
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• Xpq XprXrq for any state r 

• Xpq —> aXpiqib whenever one has in the automatonpush(7)) and {q', pop(7))—for some 
push symbol a, pop symbol b and stack letter 7. 

• Xpq —> aXpiq whenever one has in the automaton p — >p' for some neutral symbol a. 

• Xpq — Xpq! a whenever one has in the automaton q'-^q for some neutral symbol a. 

Obviously, this grammar generates language L{A). 

As we are here interested only in the length of the balanced words produced by the grammar, we can 
replace any terminal symbol by a dummy symbol (j. Now, once this is done we can put the grammar into 
Chomsky normal form by using an extra non-terminal symbol (call it Xj as it is used to produce the (j 
terminal). As we have m? + 1 non-terminal in the resulting grammar we are almost done. To get to the tight 
bound announced in the statement, one simply removes the extra non-terminal symbol Xj and reasons on 
the length of the derivation directly. □ 

We first show that the decomposition, weights and sampling we maintain are close enough to an (1 -|-e')- 
suffix sampling with the correct weights. Recall that e' = e /(6 log n). 

Lemma 5.6 (Stability lemma). Let v,yV be an unfinished peak with a sampling maintained by the algorithm. 
Then over-samples an (1 -|- e')-sujfix sampling on v, and W has size at most 144(log |r;|)(log n)/e -|- 
O(logn). 

Before proving the stability lemma, we first prove that Algorithm |5] maintains a strucutre that is not too 
far from (1 -|- e')-suffix sampling. 

Proposition 5.7. Let v be an unfinished peak, and let v^,... ,v^ be the suffix decomposition maintained by 
the algorithm. The following is true: 

(1) ,... is a valid (1 -|- e')-suffix decomposition of v. 

(2) For each letter a of every v\ and for every sample s, Pr[S'^i = o] > |o|/|r;^ I high- 

(3) Each satisfies |f^|high — I'^’^liow < 2e'\v^\xov/“i. 

Proof. Property (1) is guaranteed by the (modified) Simplify function used in Algorithm |5l which preserves 
even more suffixes than the original algorithm. 

Properties (2) and (3) are proven by induction on the last letter read by Algorithm[5] Both are true when 
no symbol has been read yet. 

We start with property (2). Let us first consider the case where we use bullet-concatenation after the 
last letter was read. Then for all vK the (modified) Concatenate function ensures Sp becomes a with 
probability l/|f^|high- Otherwise, Sp remains unchanged and by induction Sp = b with probability at least 
(1 - l/b'|high)|fe|/(|t’'|high - 1 ) = |b|/b'|high, for each other letter 6 of 

The other case is that some is computed at line [20] of Algorithm |2] In this case, v is equal to 
some (r;i ■ ■ uo concatenation. For each suffix (r;i ■ V 2 y in containing TR^, we proceed in 

the same way with the Concatenate function, replacing any sample in V 2 with R^^. Now consider V 2 the 
largest suffix of contained in V 2 , and = R^^ • uq. We use the fact that Concatenate looks at 

I[high > l^ol + \Rv 2 \ for replacing samples. This means that we choose R^^ as a sample for with 
probability (|r’^|high — Ir^oD/Ir’^lhigh > |-Ri ;2 l/|r'\igh, and therefore the property is verified. 

We now prove property (3). If has just been created, it contains only one letter of weight 1, and obvi¬ 
ously lu^liow = I'u^igh = |r'^|. In addition, unless some R^^ has been computed at line[20|of Algorithm|2| 
when the last letter was read, then |r;*| is only augmented by some exactly known |a| or |no| compared to 
the previous step. Therefore the difference ['u^lhigh — [r’^liow does not change, and by induction it remains 
smaller than 2e'|r;^|iow/3 which can only increase. Now consider R^^ computed at line l20l and = Ry^ ■ uq. 
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We again consider V 2 for the largest suffix in the decomposition of vi ■ V 2 that is contained within V 2 , as used 
in Algorithm [5l and is the suffix immediafely preceding V 2 in fhaf decomposifion. 

If \v 2 ~^ Ihigh > (1+^0llow^ then from fhe Simplify function, fhe difference befween fhose fwo suffixes 
cannot be more than one letter, and then v\ = V 2 - Therefore, we have \Rv 2 • 'WoIhigh = Ihigh + l^^ol and 
\Rv 2 • '^‘oliow = 11 ^ 2 1 low + l^^ol- We conclude by induction on |u 2 |. 

We end with the case m^^lhigh < (1 + eOmiiow- By definition, \R^^ ■ uqI high = li^^^^liiigh + ko| and 
|-Rt )2 •''^oliow = blliow + kol- Therefore the difference |u^|high — bellow is at most e'lulliow- Since the test at 
line [ 19 ] of Algorithm |2] (modified by ALgorifhm|5]) was satisfied, we know fhaf |u||iow < 2|tto|> and finally 
'yellow < 2 e'(|u 2 |iow + |^io|)/3 < 2e'|u^|iow/3, which concludes fhe proof. □ 

We can now prove fhe sfabilify lemma. 

Proof of Lemma lS^ The firsl properly is a direcl consequence of properly (1) and (2) in Proposifion l5.7[ as 
in fhe proof of Lemma 147/1 

The second is a consequence of fhe (modified) Simplify used in Algorilhm [5l f^temp is defined as 
the set of suffixes below wilh m < I such fhaf lu'^lhigh < (1 + ^Ol^^liow- Because Simplify delefes 
all bul one elemenls from Dtemp, it follows fhaf |u*“^|high > (1 + eOl^^liow- Now, from properly (3) of 
Proposition [mi we have fhaf |u^|iow > l^^^ Ihigh — 2e'|u^|iow/3 > (1 — 2e'/3)|'y^ Ihigh- Therefore we have fhaf 
1^^^ ^Ihigh > (1 + ^0(1 ~ 2eY3)|u*|high 

By successive applications, we oblain |u*“®|high > (l + e0^(f “2e'/3)^|i;^|high- Now, as lu^lhigh > \v^\ 
andlu^l > |u^|iow > (1—2e'/3)|i;\igh we have: |i;^“®|/(l—2e'/3) > (l+e')^(l—2e'/3)^|'y^|. Equivalenlly, 
|u'-6| > (l + e')^(l-2e73)^|u'|. 

Thus, fhe size of fhe suffix decomposifion is al mosl 6 log(i^£/)3(i_2£//3)4 |u| < 6 log |u|/log(l+ e'/3 + 
0(e'^)) < 144(log |u|)(logn)/e + 0(log(n)). □ 

Using fhe fesfer from Theorem 14.91 for computing each R, we can Ihen prove fhe robusfness lemma. 

Lemma 5.8 (Robusfness lemma). Let A a VPA recognizing L and let u € YA. Let i/finai the final value 
o/i?temp in the Algor it hm\^ using the tester from Theorem 14.91 at lines\14\and\20\of Aleorithm\2\ If u £ L, 
then i/final £ T; and if Rf±n&i G L, then bdists(u, L) < en with probability at least 1 — rj. 

Proof One way is easy. A direcl inspection reveals fhaf each subsfilulion of a facfor tt; by a relation R 
enlarges fhe sel of possible tu-lransilions.Therefore inai G L when u ^ L. 

For fhe ofher way, consider some word u such fhaf i?finai G L- Since fhe fester of Theorem 14.91 has 
bounded error f = rj/n and was called al mosl fhan n limes, none of fhe calls fails wifh probabilily al leasl 
1 — rj. From now on we assume fhaf we are in Ibis silualion. 

Fet h = Depth(i?finai)- We will inductively conslrucl sequences uq = u,...,Uh = f/finai and 
Vh = -Rfinai, ■■■ ,1^0 such fhaf for every 0 < I < h, up vi e (S+US_USq)*, bdists(ui, u;) < 3(/i—/)e'|tt/| 
and vi G L. Furfhermore, each word ui will be fhe word u wilh some subslifulions of faclors by relations 
R computed by fhe tester. Therefore, Depth(M;) is well defined and will satisfy Depth(M;) = 1. This will 
conclude fhe proof using thal Depth(i/finai) < log 3 /2 nfromFemma l3.4l This will give us bdists(ri, r^o) < 
Ge'n log n < en. 

We firsl define fhe sequence («;)/ (see Figure[5]for an illusfrafion). Sfarfing from uq = u, lei ui+i be fhe 
word ui where some faclors in Aq have been replaced by a (3e', S)-approximalion in Sg. These correspond 
fo all fhe approximations evenlually performed by fhe algorilhm fhaf did nol involve a symbol already in 
Sg. Observe fhaf after Ihis collapse, fhe symbol is still a (3e', S)-approximation. In particular, Uh = Runai, 
ui G (S+ U S_ U Sq)* and Depth(uz) = I by consfrucfion. 

We now define fhe sequence {vi)i such fhat vi G L. Each lefler of vi will be annolafed by an accepting 
run of slates for A. Sel vp = f/finai wilh an accepting run from pin to qf for some {pin, qj) G f/finai H 
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Figure 5: Constructing the words uq, ui and U2 as in Lemma [53] where Depth(i?finai) = 2 


{Qin X Qf)- Consider now some level I < h. Then vi is simply vi+i where some letters R G Sq in common 
with ui+i are replaced by some factors in m € (Aq)* as explained in the next paragraph. Those letters are 
the ones that are present in ui but not and are still present in vi^i (i.e. they have not been further 
approximated down the chain from ui+i to Uh, or deleted by edit operations moving up from Vh to vi+i). 

Let w € (Aq)* be one of those factors and R G Sq its respective (3e', S)-approximation. By hypothesis 
R is still in vi^i and corresponds to a transition (p, q) of the accepting run of vi^i. We replace i? by a factor 

w' such that and bdists(r(;, w') < 3e'|u)|, and annotate w' accordingly. By construction, the resulting 

word vi satisfies vi ^ L and bdists(ui, t';) < 3(/i — l)e'\ui\. □ 
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A A Tester for Weighted Regular Languages 

We design a non-adaptive property tester for weighted regular languages that serves as a basic routine of our 
main algorithm. Property testing of regular languages was first considered in |Tl for the Hamming distance 
and we adapt this tester to weighted words for the simple case of edit distance. Such a property tester has 
been already constructed first for edit distance in ll^ . and later on for weighted words in 1251 . with an 
approach based on [Jj. 

In this work, we take an alternative approach that we believe simpler, but slightly less efficient than the 
tester of Ii25l . We consider the graph of components of the automaton and focus on paths in this graph; we 
however introduce a new criterion, K-saturation (for some parameter 0 < k < 1 ), that permits to significantly 
simplify the correctness proof of the tester compared to the one in [T] and in ll25l . In particular Lemma IA3] 
permits to design a non-adaptive tester for L and also to approximate the action of u on .4. as follows. 

Definition A.l. Let S' C E and R C Q x Q. Then R (e, S') -approximates a word u on A (or simply 
e-approximates when S' = S), if for all p,q € Q: (1) {p, q) & R when p — >q; (2) uis (e, 12')-close to some 
word V satisfying p-^q when {p, q) € R. 

Our main contribution is the following one. 

Theorem A.2. Let A be an automaton with m > 2 states and diameter d > 2. Let e > 0, ry > 0, f > 
2\2dm^{log l/ry)/e] and k > \2dm/e']. There is an algorithm that, given t random factors ofvi,... ,vt of 
some weighted word u, such that each Vi comes from an independent k-factor sampling on u, outputs a set 
R Q X Q that £-approximates u on A with one-sided error rj. 

This is still true with any combination of the following generalization: 

• The algorithm is given an over-sampling of each of factors Vi instead. 

• When A is Ti'-dosed, and d is the Ti'-diameter of A, then R also (e, 'E,')-approximates u on A 

The rest of this section is devoted to the proof of Theorem IA.2I and therefore we fix a regular language 
L recognized by some finite state automaton .4. on S with a set of states Q of size m > 2, and a diameter 
d > 2. Define the directed graph G_a on vertex set Q whose edges are pairs (p, q) when p — >q for some 
a G S. 

A component C of Gj\^ is a maximal subset (w.r.t. inclusion) of vertices of such that for every pi,P 2 
in G one has a path in G_a from pi to p 2 - The graph of components of G_a describes the transition 
relation of A on components of Gx- its vertices are the components and there is a directed edge (Ci, G 2 ) if 
there is an edge of from a vertex in Gi toward a vertex in G 2 . 

Definition A.3. Let G be a component of Gx l^t H = (Gi, ... ,Gi) be a path in Qx 

• A word u is G-compatible if there are states p,q ^ G such that p-^q. 

• A word u is H-compatible ifu can be partitioned into u = viaiV 2 ■ ■ ■ ai-ivi such that pi — Aqi and 

where Vi is a factor, a* a letter, and pi, qi € Gi. 

• A sequence of factors {vi,... ,vt) of a word u is H-compatible if they are factors of another H- 
compatible word with the same relative order and same overlap. 

Note that the above properties are easy to check. Indeed, G-compatibility is a reachability property 
while the two others easily follow from G-compatibility checking. 

We now give a criterion that characterizes those words u that are e-far to every H-compatible word. Note 
that it will not be used in the tester that we design in Theorem I A.2I for weighted regular languages, but only 
in Lemma Ia 3] which is the key tool to prove its correctness. 

For a component G and a G-incompatible word v, let ui ■ a be the shortest G-incompatible prefix of v. 
We define and denote the C-cut of r; as r; = ui • a • ^ 2 - When tii is not the empty word, we say that vi is a 
G-factor and a is a C-separator for vi, otherwise we say that a is a strong G-separator. 
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Fix a path II = (Ci ,... ,Ci) in Qj^, a parameter 0 < k < 1, and consider a weighted word u. We 
define a natural partition of u according to II, that we call the U-partition of u. For this, start with the first 
component C = Ci, and consider the Ci-cut ui • a • ri 2 of u. Next, we inductively continue this process with 
either the suffix o • tt 2 if a is a Ci-separator, or the suffix U 2 if a is a strong Ci-separator. Based on some 
criterion defined below we will move from the current component Ci to a next component Cj of II, where 
most often j = i + 1, until the full word u is processed. If we reach j = I + 1, we say that u K-saturates 
n and the process stops. We now explain how we move on in II. We stay within Ci as long as both the 
number of Cj-factors and the total weight of strong Cj-separators are at most k\u\ each. Then, we continue 
the decomposition with some fresh counting and using a new component Cj selected as follows. One sets 
j = i + 1 except when the transition is the consequence of a strong Cj-separator a of weight greater than 
k\u\, that we call a heavy strong separator. In that case only, one lets j > f -h 1, if exists, to be the minimal 
integer such that q — >q' with q € Cj-i U Cj and q' G Cj, and j = I + 1 otherwise. 

Proposition A.4. Let 0 < k < e/{2dl). Ifu is e-far to every H-compatible word, then u K-saturates II. 

Proof. The proof is by contraposition. For this we assume that u does not ^-saturate Ft and we correct u to 
a Il-compatible word as follows. 

First, we delete each strong separator of weight less that k\u\. Their total weight is at most 21k\u\. 
Because u does not saturate, each strong separator of weight larger than k\u\ fits in the Il-partition, and does 
not need to be deleted. 

We now have a sequence of consecutive Cj-factors and of heavy strong Ci-separators, for some 1 < f < 
/, in an order compatible with II. However, the word is not yet compatible with II since each factor may 
end with a state different than the first state of the next factor. However, for each such pair there is a path 
connecting them. We can therefore bridge all factors by inserting a factor of weight at most d, the diameter 
of A. 

The resulting word is then H-compatible by construction, and the total cost of the edit operations is at 
most {21 -I- dl)K\u\ < e|n|, since d >2. □ 

For a weighted word u, we remind that the A;-factor sampling on u is defined in Section |2] The following 
lemma is the key lemma for the tester for weighted regular languages. 

Lemma A.5. Let u be a weighted word, letH = Ci... Ci be a path in Qj^. Let 0 < k < e/{2dl) and let W 
denote the \2/n^-factor sampling on u. Then for every 0 < p < 1 and t > 2/ (log ljp)jK, the probability 
P(«,n) = [(r;i ,... ,vt) is li-compatible] satisfies P{u,Il) = 1 when u is U-compatible, 

and P{u, H) < r/ when u is e-far for from being U-compatible. 

Proof. The first part of the theorem is immediate. For the second part, assume that u is e-far from any H- 
compatible word. For simplicity we assume that 2/k and K|n|/2 are integers. We first partition u according 
to H and k. Then, Proposition IA.4l tells us that u K-saturates H. For each Ci, we have three possible cases. 

1. There are k\u\ disjoint C*-factors in u. Since they have total weight at most |n|, there are at least 
k\u\/2 of them whose weight is at most 2/k each. Since each letter has weight at least 1, the total 
weight of the first letters of each of those factors is at least k\u\/2. Therefore one of them together 
with its Cj-separator is a sub-factor of some sampled factor Vj with probability at least 1 — (1 — Kj2Y. 

2. The total weight of strong Cj-separators of u is at least K|n|. Therefore one of them is the first letter 
of some sampled factor Vj with probability at least 1 — (1 — k)^ 

3. There is not any Cj-factor and any Cj-separator of u, because of a strong Cj/-separator of weight 
greater than k\u\, for some i' < i. This separator is the first letter of some sampled factor Vj with 
probability at least 1 — (1 — k)^ 
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By union bound, the probability that one of the above mentioned samples fails to occurs is at most 
/(I — kY < 7]. We assume now that they all occur, and we show that they form a Il-incompatible sequence. 
For each i, let Wi be the above described sub-factors of those samples. Each Wi appears in u after Wi-i or, 
in the case of a strong separator of heavy weight, Wi = Wi-i. Moreover each factor Wi which is distinct 
from Wi-i forces next factors to start from some component Q' with i' > z. As a result (tui,..., ru;) is not 
fl-compatible, and as a consequence {vi,... ,vt) neither, so the result. □ 

We can now conclude with the proof of Theorem lA.2l 
Proof of Theorem \A.2\ The algorithm is very simple: 

1. Seti? = 0 

2. For all states p,q ^ Q 

(a) Check if factors vi,... ,vt could come from a word v such that p-^q 
//Step (a) is done using the graph Q_a of connected components of A 

(b) If yes, then add (p, q) to R 

3. Return R 

It is clear that this R contains every (p, q) such that p-^q. Now for the converse, we will show that, 
with bounded error rj, the output set R only contains pairs (p, q) such that there exists a path If = Ci,...,Ci 
on such that p £ Ci, q £ Ci, and u is Il-compatible. In that case, there is an e-close word v satisfying 

V 

p — >q. 

Indeed, using I < m and Lemma lA.51 with t, k = ej{2dm) and p' = the samples satisfy 

P(rt, n) < ?y/2"^, when u is not Il-compatible. Therefore, we can conclude using a union bound argument 
on all possible paths on ^_ 4 , which have cardinality at most 2™, that, with probability at least 1 — p, there is 
no n such that the samples are Il-compatible but u is not Il-compatible. 

The structure of the tester is such that it has only more chances to reject a word that is not Il-compatible 
given an over-sampling as input instead. Words u such that p —>q will always be accepted no matter the 
amount and length of samples. Therefore the theorem still holds with an over sampling. 

Last, A being S'-dosed ensures that the notions of compatibility and saturation remain unchanged. 
Using the E'-diameter in Lemma IA3] (and therefore in Proposition IA.4I) let us use bridges in S'* instead of 
S* with weight at most d. □ 
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